What is Maximum Likelihood?

What IWLS Does

IWLS is the computational method used to fit GLMs. It’s essentially a clever way to approximate maximum likelihood estimation (MLE) using repeated weighted least squares steps.

The process:

  1. Start with an initial guess for the model parameters (e.g., assume all coefficients are zero).

  2. Linearize the model using a first-order Taylor expansion (this is like unwrapping the curve so it looks linear).

  3. Weight the data points based on how reliable they are:

    Points with low variance get higher weight (more trust).

    Points with high variance get lower weight.

  4. Solve this weighted least squares problem to update parameter estimates.

  5. Iterate: Use the new estimates, recompute weights, and repeat steps 2–4.

  6. Stop when changes between iterations are very small (convergence).

This iterative approach allows us to handle complex likelihood surfaces for non-normal data.



Technical Details


Fit Standard OLS Model and Compare It To Maximum Likelihood Model

#janka is pronounced 'Yangka'
data(janka) # load data
str(janka) # look at properties of the dataframe
## 'data.frame':    36 obs. of  2 variables:
##  $ Density : num  24.7 24.8 27.3 28.4 28.4 29 30.3 32.7 35.6 38.5 ...
##  $ Hardness: int  484 427 413 517 549 648 587 704 979 914 ...


OLS Model to Minimize Sum of Squared Residuals

The Janka hardness test, created by Austrian-born American researcher Gabriel Janka (1864–1932), measures the resistance of a sample of wood to denting and wear. It measures the force required to embed a .44 in steel ball halfway into a sample of wood. This ball diameter was chosen to produce a circle with an area of 100 square millimeters, or one square centimeter.

A common use of Janka hardness ratings is to determine whether a species is suitable for use as flooring.



Based on the sampled species above it looks like the hardest Brazilian hardwoods are much harder than you might expect based on their ordering. The pattern looks ‘non-linear’ and almost seems exponential at the bottom (hardest) end.

#Timber hardness measured on the 'Janka' scale
plot_ols <-
  ggplot(janka, aes(x=Density, y=Hardness)) + 
  geom_point()+
  geom_smooth(method=lm, se=FALSE) +
  geom_smooth(color = "orange", se = FALSE) +
  ggtitle("Fiber density vs. wood hardness")

plot_ols

## Normal least squares linear regression using lm(): 
janka.ls1 <- lm(Hardness~Density, data = janka)

summary(janka.ls1)
## 
## Call:
## lm(formula = Hardness ~ Density, data = janka)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -338.40  -96.98  -15.71   92.71  625.06 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1160.500    108.580  -10.69 2.07e-12 ***
## Density        57.507      2.279   25.24  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 183.1 on 34 degrees of freedom
## Multiple R-squared:  0.9493, Adjusted R-squared:  0.9478 
## F-statistic:   637 on 1 and 34 DF,  p-value: < 2.2e-16




First we need to remind ourselves about the (important) assumptions of linear modeling:

For our purposes here, the first two (linearity and constant variance) are the most relevant.


Check assumptions for fitting a linear model. I suggest a way better than the autoplot function in ggfortify that we have used before.

performance::check_model(janka.ls1) #easystats package

There are two very subtle problems with the OLS model:

One way to deal with these issue is to transform the response variable. But how should be transform the response? What mathematical function should be use?

Another way is to deal with these issues is to perform the transformation directly within the linear model specification (we will do this later using a GLM).


Maximum Likelihood Model to Find Best Data Transformation

Figure 8.1 A likelihood surface for the Box–Cox function applied to the simple linear regression of the Janka wood hardness data.

Box Cox transformations are done by raising each point of the response variable (y) to a power (lambda, \(\lambda\)) that is systematically varied. For example, when \(\lambda\)=2, the data are squared, when \(\lambda\)=0.5 they are square rooted, and so on. In Box-Cox transformations, only the response variable (y) is transformed to improve model assumptions, while the predictor variables (x) remain unchanged.

boxcox(janka.ls1) #see linear regression for janka.ls1

The peak \(\lambda\) (transformation value) looks to be around 0.5 with confidence intervals definitely spanning 0.5.

The R output for the boxcox() function plots the maximum likelihood surface (the curve) together with a maximum likelihood-based 95% CI.

The 95% CI is produced by dropping down 1.92 log-likelihood units from the maximum likelihood value, moving horizontally left and right (the dotted line labelled 95% in Fig. 8.1) until the likelihood surface is met and then dropping down from these points as shown by the outer vertical dotted lines in the same figure. The value of 1.92 derives from the critical value of chi-squared (deviance, the generalized equivalent of SS, is distributed approximately following chi-square) with 1 DF and at the P < 0.05 level of confidence in question, divided by two (for a two-tailed test).


So how do we interpret \(\lambda\) (ML-estimated transformation value)?

Any number raised to the power of zero equals one, a special behavior is defined that integrates smoothly with the other transformations: \(\lambda\)=0 is the natural log transformation.

The data raised to the power one are untransformed.

If the lambda is two then y1 = y2 (each value is squared)

If the lambda is three then y1 = y3 (each value is cubed)

For negative values of lambda, we use the reciprocals.

Original Data and Powers of y
Untransformed \(y^2\) \(y^3\)
2.00 4.00 8.00
5.64 31.81 179.41
9.26 85.75 794.02
14.26 203.35 2899.74
34.21 1170.32 40036.79



Common Box-Cox Transformations

Lambda value (λ) Transformed data (Y’)
-3 Y-3 = 1 / Y3
-2 Y-2 = 1 / Y2
-1 Y-1 = 1 / Y1
-0.5 Y-0.5 = 1 / √(Y)
0 log(Y) **
0.5 Y0.5 = √(Y)
1 Y1 = Y
2 Y2
3 Y3

Note:
The transformation for zero is log(0), otherwise all data would transform to Y⁰ = 1.
The transformation doesn’t always work well, so make sure you check your data after the transformation with a normal probability plot.

Reference:
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211–252.
Available online here.


What do these different transformations look like?

The solid black line graphs the Box-Cox transformation for 𝜆=1, which is just 𝑥→𝑥−1. It merely shifts the center of the batch to 0 (as do all the Box-Cox transformations). The upward curving red line is for 𝜆=2. The downward curving lines are in order of increasing curvature, the smaller values of 𝜆 down to −1.


The results of the Box–Cox transformation of Janka hardness data suggest that a value of lambda of around 0.5 will give the best fit to the data so we can proceed using the square-root transformation, or rather the GLM equivalent.
Video explaining calculation of lambda: https://www.youtube.com/watch?v=vGOpEpjz2Ks&t=678s

Let’s remind ourselves what the original data look like, plotted next to the transformed (\(\lambda\)=0.5, a.k.a. square-root transformed).

#SQRT Timber hardness measured on the 'Janka' scale
plot_sqrt <-
  ggplot(janka, aes(x = Density, y = sqrt(Hardness))) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, color = "green") +
  geom_smooth(color = "orange", se = FALSE) +
  ggtitle("Fiber density vs. (sqrt) wood hardness")

plot_ols + plot_sqrt #use patchwork to show ols and sqrt figures side-by-side

We noted previously that another way is to deal with the curvature and non-constant variance issues is to perform the transformation directly within the linear model specification; We will do this using a GLM.





Basics of Generalized Linear Models

The general form of the GLM is:

glm(formula, family=familytype(link=linkfunction), data=)

GLMs were first proposed in an article by Nelder and Wedderburn in 1972 (J. R. Statist. Soc. A, 135: 370). The 1983 book Generalized linear models by McCulloch and Nelder is the standard reference.

GLMs have three components:

The first is the most familiar. In R, the linear predictor is what comes after the tilde (~) in our linear model formula. In the Janka example above it is wood density.

The variance function models the variation in the data. This is also familiar to us since OLS uses the normal distribution to model the residual variation. The difference here is that GLMs are not restricted to the normal distribution but make use of a much wider range of distribution ‘families’ including the poisson, the binomial, and the gamma.

The third component is the least familiar. The link function plays a role equivalent to the transformation in normal least squares models. However, rather than transforming the data we transform the predictions made by the linear predictor. Commonly used link functions include the log, square root, and logistic.

Response Type Distribution (Family) Link Function Typical Use
Count Poisson Log Abundance, number of individuals
Overdispersed Count Negative Binomial Log Abundance with high variance
Binary (presence/absence) Binomial Logit or Probit Species occurrence
Proportions Binomial Logit Survival rate, cover
Continuous, positive Gamma Log or Inverse Biomass, energetic rates
Continuous, unbounded Gaussian Identity Temperature, body size

GLMs allow for response variables with non-normal distributions, making them flexible for ecological data. The choice of error structure (distribution + link function) depends on the type of response variable:

To see how this works, lets fit the Janka model using both OLS and GLM.

janka.ls1 <- lm(Hardness~Density, data= janka )
anova(janka.ls1)

We can fit the same model using the GLM function:

janka.ml1 <- glm(Hardness~Density, data= janka, family= gaussian(link="identity"))
anova(janka.ml1)

Notice in the model specification above, the family=gaussian and link=identity are the defaults. Like other R code, we can omit specification of default parameters, but probably best to include them until we understand the models fully.

Gaussian is another word for ‘normal’; the identity link is equivalent to performing no transformation, so no transformation is applied to the fitted values from the linear predictor.

Notice that when using the anova() function, we get different information for lm vs glm model specification. ANOVA of the lm gives a least squares ANOVA table, whereas the glm gives a maximum likelihood analysis of deviance (ANODEV). These outputs look more different than they actually are.

Recall that for GLMs using the normal (gaussian) distribution, the deviance is the SS (sum of squared residuals). So the values for the deviance here (based on likelihood) match the conventional ANOVA table. However, there is no equivalent of the means squares in the analysis of deviance.

Quick question: What are mean squares used for? What do they tell us?

The point here is to demonstrate that the glm() function with the Gaussian variance function and identity link performs the equivalent analysis to the lm() function we can see how to take fuller advantage of its greater flexibility.

We need to model the mean and the variance. We know that in the normal least squares analysis for the Janka data, the best fit was produced by the square-root transformation, as recommended by the Box–Cox results. However, with this transformation the variance increased as the mean increased (slightly).


We can have the best of both worlds in our GLM by using a square-root link function in combination with a distribution in which the variance increases with the mean. It turns out that for the gamma distribution the variance increases as the square of the mean. We can fit a GLM with a square-root link and gamma distribution variance function as follows:

janka.gamma <- glm(Hardness~Density, data= janka, family=Gamma(link= "sqrt"))
#summary(janka.gamma)

ggplot(janka,aes(Density,Hardness)) + 
  geom_point() +
  geom_smooth(method="glm",colour="red",
                         method.args=list(family="Gamma"(link="sqrt")))+
  labs(title="GLM, square-root link, Gamma variance")
## `geom_smooth()` using formula = 'y ~ x'


Figure 8.3 might help relate the GLM approach using a link function to the linear model analysis of the square-root transformed data.

We can get CIs for the regression intercept and slope (remember the curve in the figures is linear on the scale of the link function, the square root in this case):

coef(janka.gamma)
## (Intercept)     Density 
##   1.8672884   0.7677963
confint(janka.gamma)
##                  2.5 %    97.5 %
## (Intercept) 0.09706551 3.6699483
## Density     0.72361627 0.8122638

Since confidence intervals are not based on the normal distribution, they are not constrained to be symmetric. However, they are interpreted in the same way.


So how do we tell which model is better, one fit vis OLS or one fit via maximum likelihood?

model_comp <- performance::compare_performance(janka.ls1, janka.gamma)
print_md(model_comp)
Comparison of Model Performance Indices
Name Model AIC (weights) AICc (weights) BIC (weights) RMSE Sigma R2 R2 (adj.) Nagelkerke’s R2
janka.ls1 lm 481.2 (<.001) 482.0 (<.001) 486.0 (<.001) 177.90 183.06 0.95 0.95
janka.gamma glm 453.0 (>.999) 453.7 (>.999) 457.7 (>.999) 156.13 0.10 0.98

While comparing these indices is often useful, making a decision (for instance, which model to keep or drop) can often be hard, as the indices can give conflicting suggestions. Additionally, it is sometimes unclear which index to favor in the given context.

Summary







Visual Explanation of Maximum Likelihood:

https://www.youtube.com/watch?v=XepXtl9YKwc

Maximum likelihood vs least squares in linear regression

https://www.youtube.com/watch?v=bhTIpGtWtzQ

Box-Cox:

https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/

Analogy to help logically interpret link functions: